Skip to main content

Regular expression selector

Overview

The regular expression selector is the most powerful selector in pdf2Data's toolbox. Unsurprisingly then, it is also the least user-friendly selector.

It implements the standard regular expression search, and accordingly requires knowledge of RegExp syntax from a user.

tip

Most of the data you require from a PDF can be extracted without this selector, you can follow the Getting started for example usage. However, if you feel passionate about rex exps, you don't need anything but the regular expression selector for data extraction.

caution

You can specify two-line regular expression, however in the majority of cases this can be replaced by using the Paragraph selector.

Parameters

Pattern

This selector has only one mandatory parameter - Pattern, that contains a regular expression to be found in a PDF. The regular expressions may also contain groups defined within round brackets. In this case, only the string captured by the group within brackets will be extracted.

For example, pattern Invoice\s+(\d{3}) returns a 3-digit number that appears after the word "Invoice", this number should be separated from "Invoice" by one or more spaces.

Result overview

Resultant text will be presented in lines (see type of output in Picker selector).

important

The format and example of the actual result produced by the pdf2Data Engine is described in Recognition result specification.

Specification

To see more information about properties and expert usage visit specification page.